Kruskal–Wallis H test (one-way ANOVA on ranks) — from scratch#
The Kruskal–Wallis test answers:
“Do two or more independent groups differ in their typical values when I don’t want to assume normality?”
It is a nonparametric alternative to one-way ANOVA. Instead of comparing means, it ranks all observations and checks whether some groups systematically receive higher (or lower) ranks.
Learning goals#
Know when Kruskal–Wallis is the right test (and when it isn’t).
Understand the null/alternative hypotheses and the assumptions.
Compute the H statistic from ranks and rank sums.
Implement the test with NumPy only (including tie correction).
Interpret results (p-value + effect size) and visualize what’s happening with Plotly.
Notebook roadmap#
What the test is used for
Hypotheses + assumptions
Intuition: “ANOVA on ranks”
The H statistic (with tie correction)
A tiny worked example (ranks by hand)
NumPy-only implementation
A realistic simulation + Plotly visuals
Tie correction demo
Interpretation + reporting
Pitfalls
Exercises + references
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
import sys
import plotly
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
Plotly: 6.5.2
Seed: 7
1) What is the Kruskal–Wallis test used for?#
Use Kruskal–Wallis when you have:
One categorical factor with 2+ independent groups (A/B/C/…)
A numeric or ordinal outcome
You want a test that does not assume normality
Common situations:
A/B/C testing where the metric is skewed (e.g., time-on-page).
Comparing ratings (Likert-scale) across multiple groups.
Medical/biological measurements that are heavy-tailed or contain outliers.
Relationship to other tests:
One-way ANOVA: compares means assuming (roughly) normal residuals.
Mann–Whitney U: compares two independent groups using ranks.
Kruskal–Wallis is the natural extension to k groups.
Friedman test: the rank-based analogue for repeated-measures / paired designs (not Kruskal–Wallis).
2) Hypotheses and assumptions#
2.1 Hypotheses#
Let there be \(k\) groups.
Null hypothesis \(H_0\): the groups come from the same distribution (equivalently: group label has no effect on the outcome).
Alternative \(H_1\): at least one group tends to produce larger or smaller values.
Important nuance:
Kruskal–Wallis is often described as a test of “medians”, but strictly it detects distributional differences.
If group distributions have similar shapes/spreads, then a significant result is commonly interpreted as a difference in location (typical value).
2.2 Assumptions (practical)#
Observations are independent within and across groups.
The outcome is at least ordinal (ranks make sense).
Groups are independent (between-subject design).
If independence is violated, the p-value can be severely wrong.
3) Intuition: “ANOVA on ranks”#
Pool all observations from all groups.
Replace the raw values with their ranks (1 = smallest).
Compute each group’s sum of ranks (or mean rank).
If all groups really come from the same distribution, their ranks should be well-mixed and each group’s average rank should be close to the overall average rank:
where \(N\) is the total number of observations.
If one group tends to have larger values, it will accumulate larger ranks → the rank sums become very different → the test statistic grows.
4) The H statistic (with tie correction)#
Let:
\(k\) = number of groups
\(n_i\) = sample size of group \(i\)
\(N = \sum_{i=1}^k n_i\) = total sample size
\(R_i\) = sum of ranks in group \(i\) (ranks computed on the pooled sample)
The Kruskal–Wallis statistic is:
Tie correction#
If there are ties in the pooled data, ranks are averaged. Ties reduce the variance of ranks, so we apply a correction:
where \(t_j\) are the sizes of tied groups (e.g., if the value 10 appears 4 times, one of the \(t_j\) is 4).
The tie-corrected statistic is:
Under \(H_0\) and with sufficiently large samples, \(H_{\text{corr}}\) is approximately \(\chi^2\) distributed with \(k-1\) degrees of freedom.
In this notebook we compute H exactly (NumPy-only) and estimate the p-value via a permutation test (also NumPy-only).
5) Tiny worked example (see ranks explicitly)#
We’ll use three small groups so we can see the mechanics clearly.
g1 = np.array([1, 3, 5])
g2 = np.array([2, 4, 6])
g3 = np.array([7, 8, 9])
groups_tiny = [g1, g2, g3]
group_names_tiny = ["A", "B", "C"]
def rankdata_average(x: np.ndarray):
"""Average ranks for ties (NumPy-only).
Returns
- ranks: float array of same shape as x (flattened)
- tie_counts: 1D int array containing sizes of tie groups (>1)
"""
x = np.asarray(x)
x = x.ravel()
n = x.size
if n == 0:
raise ValueError("x must be non-empty")
order = np.argsort(x, kind="mergesort")
sorted_x = x[order]
ranks_sorted = np.empty(n, dtype=float)
tie_counts = []
diffs = np.diff(sorted_x)
run_starts = np.r_[0, np.nonzero(diffs != 0)[0] + 1]
run_ends = np.r_[run_starts[1:], n]
for start, end in zip(run_starts, run_ends):
avg_rank = (start + 1 + end) / 2.0 # ranks are 1..n
ranks_sorted[start:end] = avg_rank
count = end - start
if count > 1:
tie_counts.append(count)
ranks = np.empty(n, dtype=float)
ranks[order] = ranks_sorted
return ranks, np.array(tie_counts, dtype=int)
all_values_tiny = np.concatenate(groups_tiny)
all_ranks_tiny, tie_counts_tiny = rankdata_average(all_values_tiny)
labels_tiny = np.concatenate([
np.repeat(name, len(g)) for name, g in zip(group_names_tiny, groups_tiny)
])
table = np.column_stack([labels_tiny, all_values_tiny.astype(str), all_ranks_tiny.astype(str)])
print("group value rank")
for row in table:
print(f"{row[0]:>5} {row[1]:>5} {row[2]:>4}")
print("\nTies:", tie_counts_tiny if tie_counts_tiny.size else "none")
group value rank
A 1 1.0
A 3 3.0
A 5 5.0
B 2 2.0
B 4 4.0
B 6 6.0
C 7 7.0
C 8 8.0
C 9 9.0
Ties: none
# Visualize mean ranks (this is what the test is sensitive to)
starts = np.r_[0, np.cumsum([len(g) for g in groups_tiny])[:-1]]
rank_sums_tiny = np.add.reduceat(all_ranks_tiny, starts)
ns_tiny = np.array([len(g) for g in groups_tiny])
mean_ranks_tiny = rank_sums_tiny / ns_tiny
overall_mean_rank_tiny = (len(all_ranks_tiny) + 1) / 2
fig = go.Figure(
data=[go.Bar(x=group_names_tiny, y=mean_ranks_tiny, text=np.round(mean_ranks_tiny, 2), textposition="outside")]
)
fig.add_hline(y=overall_mean_rank_tiny, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
title="Tiny example: mean ranks by group",
xaxis_title="Group",
yaxis_title="Mean rank",
yaxis=dict(range=[0, len(all_ranks_tiny) + 1]),
)
fig.show()
6) NumPy-only Kruskal–Wallis implementation#
We’ll implement:
computing ranks (average ties)
computing \(H\) and tie correction
estimating the p-value via a permutation test
The permutation test is conceptually simple:
Under \(H_0\) the group labels are exchangeable.
Shuffle which observations belong to which group (keeping group sizes fixed).
Recompute \(H\) each time.
The p-value is the fraction of shuffled datasets where \(H_{\text{perm}} \ge H_{\text{obs}}\).
def kruskal_wallis_h(*groups: np.ndarray):
"""Compute Kruskal–Wallis H statistic (tie-corrected), NumPy-only."""
if len(groups) < 2:
raise ValueError("Need at least two groups")
groups = [np.asarray(g).ravel().astype(float) for g in groups]
if any(g.size == 0 for g in groups):
raise ValueError("All groups must be non-empty")
ns = np.array([g.size for g in groups], dtype=int)
k = len(groups)
n_total = int(ns.sum())
pooled = np.concatenate(groups)
ranks, tie_counts = rankdata_average(pooled)
starts = np.r_[0, np.cumsum(ns)[:-1]]
rank_sums = np.add.reduceat(ranks, starts)
H = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums**2) / ns) - 3.0 * (n_total + 1.0)
if tie_counts.size:
tie_term = np.sum(tie_counts**3 - tie_counts)
C = 1.0 - tie_term / (n_total**3 - n_total)
else:
C = 1.0
H_corr = H / C
mean_ranks = rank_sums / ns
df = k - 1
# A common effect size for Kruskal–Wallis
epsilon_sq = max(0.0, (H_corr - (k - 1)) / (n_total - k)) if n_total > k else np.nan
details = {
"H": float(H_corr),
"H_uncorrected": float(H),
"tie_correction_C": float(C),
"df": int(df),
"n_total": int(n_total),
"group_sizes": ns,
"rank_sums": rank_sums,
"mean_ranks": mean_ranks,
"epsilon_squared": float(epsilon_sq),
}
return details
def kruskal_wallis_permutation_test(*groups: np.ndarray, n_perm: int = 5000, seed: int = 0):
"""Permutation p-value for Kruskal–Wallis (NumPy-only).
Returns
- p_value
- H_obs
- H_perm: permutation distribution
"""
if n_perm <= 0:
raise ValueError("n_perm must be positive")
groups = [np.asarray(g).ravel().astype(float) for g in groups]
ns = np.array([g.size for g in groups], dtype=int)
if any(ns == 0):
raise ValueError("All groups must be non-empty")
pooled = np.concatenate(groups)
ranks, tie_counts = rankdata_average(pooled)
n_total = ranks.size
k = len(groups)
starts = np.r_[0, np.cumsum(ns)[:-1]]
# tie correction is constant under permutation (pooled values fixed)
if tie_counts.size:
tie_term = np.sum(tie_counts**3 - tie_counts)
C = 1.0 - tie_term / (n_total**3 - n_total)
else:
C = 1.0
rank_sums_obs = np.add.reduceat(ranks, starts)
H_obs = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums_obs**2) / ns) - 3.0 * (n_total + 1.0)
H_obs /= C
rng_local = np.random.default_rng(seed)
H_perm = np.empty(n_perm, dtype=float)
for i in range(n_perm):
shuffled = rng_local.permutation(ranks)
rank_sums = np.add.reduceat(shuffled, starts)
H = (12.0 / (n_total * (n_total + 1.0))) * np.sum((rank_sums**2) / ns) - 3.0 * (n_total + 1.0)
H_perm[i] = H / C
# +1 correction avoids p=0 and is common for Monte Carlo/permutation tests
p_value = (1.0 + np.sum(H_perm >= H_obs)) / (n_perm + 1.0)
return float(p_value), float(H_obs), H_perm
details_tiny = kruskal_wallis_h(*groups_tiny)
p_tiny, H_obs_tiny, H_perm_tiny = kruskal_wallis_permutation_test(*groups_tiny, n_perm=5000, seed=SEED)
print("Kruskal–Wallis (tiny example)")
print("H (tie-corrected):", details_tiny["H"])
print("df:", details_tiny["df"])
print("Permutation p-value:", p_tiny)
print("Mean ranks:", np.round(details_tiny["mean_ranks"], 3))
print("Epsilon^2 (effect size):", np.round(details_tiny["epsilon_squared"], 4))
Kruskal–Wallis (tiny example)
H (tie-corrected): 5.600000000000001
df: 2
Permutation p-value: 0.05398920215956809
Mean ranks: [3. 4. 8.]
Epsilon^2 (effect size): 0.6
fig = go.Figure()
fig.add_trace(go.Histogram(x=H_perm_tiny, nbinsx=40, name="H under H0 (permutations)", opacity=0.75))
fig.add_vline(x=H_obs_tiny, line_color="crimson", line_width=3, annotation_text="observed H", annotation_position="top")
fig.update_layout(
title=f"Tiny example: permutation distribution of H (p ≈ {p_tiny:.4f})",
xaxis_title="H statistic",
yaxis_title="Count",
)
fig.show()
Interpreting the tiny example#
A small p-value means: datasets like this are rare under the null model where group labels do not matter.
Kruskal–Wallis is an omnibus test:
it can tell you “something differs”
it does not tell you which pairs differ (that requires post-hoc tests)
7) A more realistic example + Plotly visuals#
We’ll simulate three groups with skewed (log-normal-like) outcomes.
Group A and B have similar typical values
Group C is shifted upward
This setup is useful because:
one-way ANOVA relies on (roughly) normal residuals
Kruskal–Wallis is robust to skew/outliers because it works on ranks
n = 60
# Skewed data: exponentiate normal noise (log-normal-ish)
A = np.exp(rng.normal(loc=0.0, scale=0.6, size=n))
B = np.exp(rng.normal(loc=0.05, scale=0.6, size=n))
C = np.exp(rng.normal(loc=0.35, scale=0.6, size=n))
groups = [A, B, C]
group_names = ["A", "B", "C"]
# Violin plot of raw values (skew + group shift)
fig = go.Figure()
for name, values in zip(group_names, groups):
fig.add_trace(
go.Violin(
y=values,
name=name,
box_visible=True,
meanline_visible=True,
points="all",
jitter=0.25,
)
)
fig.update_layout(
title="Simulated skewed data: raw values by group",
xaxis_title="Group",
yaxis_title="Outcome value",
)
fig.show()
details = kruskal_wallis_h(*groups)
p_value, H_obs, H_perm = kruskal_wallis_permutation_test(*groups, n_perm=8000, seed=SEED)
print("Kruskal–Wallis (simulation)")
print("H (tie-corrected):", np.round(details["H"], 4))
print("df:", details["df"])
print("Permutation p-value:", np.round(p_value, 6))
print("Mean ranks:", np.round(details["mean_ranks"], 3))
print("Epsilon^2 (effect size):", np.round(details["epsilon_squared"], 4))
Kruskal–Wallis (simulation)
H (tie-corrected): 9.7284
df: 2
Permutation p-value: 0.007499
Mean ranks: [ 77.083 87.983 106.433]
Epsilon^2 (effect size): 0.0437
# Mean ranks summarize the rank shift between groups
overall_mean_rank = (details["n_total"] + 1) / 2
fig = go.Figure(
data=[
go.Bar(
x=group_names,
y=details["mean_ranks"],
text=np.round(details["mean_ranks"], 2),
textposition="outside",
)
]
)
fig.add_hline(y=overall_mean_rank, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
title="Simulation: mean ranks by group (rank-based signal)",
xaxis_title="Group",
yaxis_title="Mean rank",
)
fig.show()
# Visualize rank distributions directly
pooled = np.concatenate(groups)
ranks, _ = rankdata_average(pooled)
labels = np.concatenate([np.repeat(name, len(g)) for name, g in zip(group_names, groups)])
fig = go.Figure()
start = 0
for name, g in zip(group_names, groups):
end = start + len(g)
fig.add_trace(
go.Violin(
y=ranks[start:end],
name=name,
box_visible=True,
meanline_visible=True,
points=False,
)
)
start = end
fig.add_hline(y=(len(ranks) + 1) / 2, line_dash="dash", line_color="gray", annotation_text="overall mean rank")
fig.update_layout(
title="Simulation: rank distributions by group",
xaxis_title="Group",
yaxis_title="Rank (pooled)",
)
fig.show()
# Permutation distribution of H under the null
fig = go.Figure()
fig.add_trace(go.Histogram(x=H_perm, nbinsx=60, name="H under H0 (permutations)", opacity=0.75))
fig.add_vline(x=H_obs, line_color="crimson", line_width=3, annotation_text="observed H", annotation_position="top")
fig.update_layout(
title=f"Simulation: permutation distribution of H (p ≈ {p_value:.6f})",
xaxis_title="H statistic",
yaxis_title="Count",
)
fig.show()
8) Tie correction demo#
Ties happen when:
the measurement is discrete (e.g., integer ratings)
values are rounded
the data have many repeated values
Tie correction increases \(H\) slightly (because ties reduce rank variance).
A_tied = np.round(A, 1)
B_tied = np.round(B, 1)
C_tied = np.round(C, 1)
details_tied = kruskal_wallis_h(A_tied, B_tied, C_tied)
print("With ties (rounded to 0.1)")
print("H_uncorrected:", np.round(details_tied["H_uncorrected"], 4))
print("tie correction C:", np.round(details_tied["tie_correction_C"], 6))
print("H_corrected:", np.round(details_tied["H"], 4))
With ties (rounded to 0.1)
H_uncorrected: 9.8297
tie correction C: 0.99506
H_corrected: 9.8785
# Compare rank distributions when ties are introduced
pooled_tied = np.concatenate([A_tied, B_tied, C_tied])
ranks_tied, tie_counts = rankdata_average(pooled_tied)
fig = go.Figure()
start = 0
for name, g in zip(group_names, [A_tied, B_tied, C_tied]):
end = start + len(g)
fig.add_trace(go.Violin(y=ranks_tied[start:end], name=name, box_visible=True, meanline_visible=True))
start = end
fig.update_layout(
title=f"Ranks with ties present (number of tie groups >1: {len(tie_counts)})",
xaxis_title="Group",
yaxis_title="Rank",
)
fig.show()
9) How to interpret and report the result#
9.1 What does a significant result mean?#
If the p-value is below your significance level (e.g., \(\alpha=0.05\)), you reject \(H_0\).
Meaning: the data provide evidence that at least one group differs in its distribution (often interpreted as a location/median difference if shapes are similar).
9.2 What you should report#
sample sizes per group
the statistic: \(H\) (and whether tie-corrected)
degrees of freedom: \(k-1\)
p-value
an effect size (e.g., \(\varepsilon^2\) / epsilon-squared)
Example write-up:
A Kruskal–Wallis test showed a difference between groups, \(H(2)=8.31\), \(p=0.015\), \(\varepsilon^2=0.12\).
9.3 Post-hoc comparisons#
Kruskal–Wallis does not tell you which groups differ. If it’s significant, follow up with pairwise tests (e.g., Mann–Whitney U) and a multiple-comparisons correction (Bonferroni/Holm/Benjamini–Hochberg).
10) Pitfalls and diagnostics#
Independence is crucial: correlated/repeated measures invalidate the test.
Different shapes/spreads: a significant result may be driven by variance/shape differences, not just median shifts.
Many ties: use tie correction (built-in above). If ties are extreme, consider exact/permutation approaches.
Omnibus nature: significant does not identify which groups differ; use post-hoc tests.
Practical significance: always pair p-values with an effect size and plots.
11) Exercises#
Implement a pairwise post-hoc routine using permutation tests (with Holm correction).
Compare permutation p-values vs the \(\chi^2\) approximation for increasing sample sizes.
Construct a case where group medians are equal but spreads differ. Does Kruskal–Wallis reject?
References#
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis.
Conover, W. J. (1999). Practical Nonparametric Statistics.
SciPy documentation for
scipy.stats.kruskal(useful for verification).